Preparing the Data

This dataset is a combination of two datasets, one with information about the 2016 presidential election and one with information about the 2020 presidential election. Both contain the voting data for all 50 states within the United States, while the 2020 data set also contains various other metrics as well. Because of this, most analysis on metrics outside of votes will focus on the 2020 election unless otherwise stated.

One thing unique to this dataset is that the District of Columbia (D.C.) is categorized as a state for voting purposes, despite it being classified as a federal district. This is because citizens in D.C. also have voting rights in elections. Therefore, all analysis going forward will be on 51 states as the default unless otherwise stated.

Most of the numeric data contained in the dataset is also in a form incompatible for analysis with R functions (commas, percent signs, etc.), so conversion of data to compatible data types will be done on a ‘need be’ basis for each analysis.

# The following libraries will be used in the following analysis, as well as 
# options for plots

library(plotly)
library(sampling)
library(stringr)
library(tidyverse)

options(scipen = 999)

# Because data set is a combination of two data sets, you can download the new
# data set at the following link:
# https://docs.google.com/spreadsheets/d/1RXqnZ7RNXPMUKZcn_eVeZqmuUQPAOEdIEWLi865dGFY/edit?usp=sharing
# After downloading the file as a csv, rename the file to "Election_Data"

# Use function getwd() to get current directory of IDE, and move downloaded
# data file to the current directory with setwd() if necessary.

# If successful, the following line of code should work and all analysis 
# will be possible

file = read.csv("Election_Data.csv", header = TRUE)

# As mentioned earlier, most of the data contain data types unfit for R
# functions. As such, R code for converting data into suitable data types
# will precede all analysis in the corresponding code blocks.

# The original data sets come from these two following links:
# 2020: https://www.kaggle.com/imoore/2020-us-general-election-turnout-rates
# 2016: http://www.electproject.org/2016g

Votes Cast in Presidential Election

No Vote For President

How many people voted in the 2020 presidential election? These are eligible citizens that voted in the election, but chose not to vote for a president. Below is a bar plot that shows (in millions), the amount of people that voted per state in the 2020 election based on the column total ballots counted estimate.

voted2020 = as.numeric(gsub(",", "", file$Total.Ballots.Counted..Estimate.[2:52]))

plot_ly(x = file$State[2:52], y = voted2020, type = 'bar') %>%
  layout(title = "Votes Counted Per State",
         xaxis = list(tickfont = list(size = 11)))

Voted For President

How many people voted in the 2020 presidential election for a president? These are eligible citizens that voted in the election, AND also voted for a president. Below is a bar plot that shows (in millions), the amount of people that voted per state in the 2020 election. Since the data for votes for highest office president had some empty values, a linear regression was used between votes for highest office president and total ballots counted estimate to replace the blank spaces with data on the line of best fit.

pres2020 = as.numeric(gsub(",", "", file$Vote.for.Highest.Office..President.[2:52]))

total_votes = as.numeric(gsub(",", "", file$Total.Ballots.Counted..Estimate.[2:52]))

presdf = data.frame(total_votes, pres2020)
presdf = na.omit(presdf)

m1 = lm(pres2020~total_votes)

presdf = data.frame(total_votes, pres2020)
presdf["Predicted"] = round((m1$coefficients[2] * total_votes) 
                            + m1$coefficients[1], digits = 0)

plot_ly(x = file$State[2:52], y = presdf$Predicted, type = 'bar') %>%
  layout(title = "Presidential Votes Counted Per State",
         xaxis = list(tickfont = list(size = 11)))

As we can see from both graphs that they both follow a similar pattern. Checking the y-axis scale on the interactive plots, we can see that the difference between the two graphs is minimal, only around a 1% difference in counts proportional to each state. We can conclude that from both these graphs that most people who voted in the election did opt to vote for a presidential candidate as well.

Parole and Probation in 2016

Probation is defined as offenders that have been convicted (in court) of an offense, but has had their prison sentence suspended for various reasons. Prisoners on probation are generally allowed to reintegrate into society while being monitored by an assigned probation officer. Parole is also defined as offenders that have been convicted (in court) of an offense, but spends time incarcerated before being released back in society. Due to state laws differing between states, not all offer this service to those in prison. Below is a chart that shows the amount of states that give parole and probation, and ones that don’t. This data was plotted based on the parole and probation columns respectively.

parole = as.data.frame(table(file$Parole[2:52] > 0))[-1]
rownames(parole) = c("No Parole", "Parole")
a = plot_ly(x = c("No Parole", "Parole"), y = parole$Freq, type = 'bar', 
            name = "Parole") 


probation = as.data.frame(table(file$Probation[2:52] > 0))[-1]
rownames(probation) = c("No Probation", "Probation")
b = plot_ly(x = c("No Probation", "Probation"), y = probation$Freq, 
            type = 'bar', name = "Probation") 

subplot(a, b) %>% layout(title = "Prisoners Given Parole/Probation")

Generally speaking, most states that offer parole also offer probation. From the graph, we can see that there’s only 3 states that don’t follow the rule, where they only give out parole and no probation.

Examining Total Amount of Votes

Votes vs. Presidential Votes

Earlier, we concluded that there was only a 1% difference in a vote also being a presidential vote proportional to each state. We can double check this conclusion by plotting the data total ballots counted estimate and votes for highest office president against each other in a barplot.

plot_ly(x = file$State[2:52], y = presdf$Predicted, type = 'bar', name = "President Vote") %>%
  add_trace(y = voted2020, name = "Vote") %>%
  layout(title = "Presidential Votes vs. Votes Counted Per State",
         xaxis = list(tickfont = list(size = 8)))

We can now see visually that there is indeed a minimal difference between a presidential vote and a non-presidential vote for all states, so we can once again confidently conclude that most people who voted in the 2020 election did opt to vote for a president as well.

Counted Votes vs. All Votes

Citizens in the U.S. are eligible to vote in elections when they turn 18 and have no criminal charges on record (although this varies by state). However, not all citizens vote in an election for various reasons. The following bar plot compares two columns in the dataset, those who did voted in the the 2020 election (total ballots counted estimate) against those that were eligible to vote in the 2020 election (voting eligible population vep).

voting_elig = as.numeric(gsub(",", "", file$Voting.Eligible.Population..VEP.[2:52]))


tab = matrix(c(as.integer(gsub(",", "", file$Total.Ballots.Counted..Estimate.[2:52])),
               voting_elig), ncol = 2)

colnames(tab) = c('Voted_2020', 'Can_Vote_2020')
rownames(tab) = c(file$State[2:52])

tab = as.data.frame(tab)

plot_ly(x = file$State[2:52], y = tab$Voted_2020, type = 'bar', name = "Voted") %>%
  add_trace(y = tab$Can_Vote_2020, name = "Voting Eligible") %>%
  layout(title = "Votes Cast vs. Total Avaliable Votes",  
         xaxis = list(tickfont = list(size = 8)))

We can see that there is still a good amount of votes in each state that were not cast, which may or may not influence the outcome of the election. The only thing we can conclude from this that there are many citizens that are able to vote, but either made a choice not to or were unable to do so.

Voting Distribution by Year

How many people voted during each presidential election year? The following bar plot displays from all states that amount of people that voted in the 2020 (total ballots counted estimate) and 2016 presidential election. Since the original data set didn’t have a numerical value of people voting in 2016, a new column named voted2016 was created by multiplying vep turnout rate 2016 and voting eligible population vep to get those values.

turnout2016 = as.numeric(gsub("%", "", file$VEP.Turnout.Rate.2016[2:52])) * 0.01
voting_age = as.numeric(gsub(",", "", file$Voting.Eligible.Population..VEP.[2:52]))

voted2016 = round(turnout2016 * voting_age, digits = 0)

tab = matrix(c(as.integer(gsub(",", "", file$Total.Ballots.Counted..Estimate.[2:52])),
               voted2016), ncol = 2)

colnames(tab) = c('Voted_2020', 'Voted_2016')
rownames(tab) = c(file$State[2:52])

tab = as.data.frame(tab)

plot_ly(x = file$State[2:52], y = tab$Voted_2016, type = 'bar', name = "2016") %>%
  add_trace(y = tab$Voted_2020, name = "2020") %>%
  layout(title = "Voted 2016 vs. Voted 2020",  
         xaxis = list(tickfont = list(size = 8)))

We will also plot the percentage difference as well, in addition to the numerical difference to see how different the values are. The following histogram groups the differences in voter turnout percentage in 2020 (vep turnout rate 2020) and voter turnout percentage in 2016 (vep turnout rate 2016) in 1% bins. The calculation done was [2020 turnout rate - 2016 turnout rate], so a positive percentage means more percentage of people voting in 2020, and a negative percentage means more percentage of people voting in 2016.

vep2016 = as.numeric(str_replace_all(file$VEP.Turnout.Rate.2016[2:52], "%", ""))
vep2020 = as.numeric(str_replace_all(file$VEP.Turnout.Rate.2020[2:52], "%", ""))

tab2 = matrix(c(vep2020 - vep2016))
colnames(tab2) = c('Voting_Diff')
rownames(tab2) = c(file$State[2:52])

tab2 = as.data.frame(tab2)

plot_ly(tab2, x = ~Voting_Diff, type="histogram") %>%
  layout(yaxis = list(title = "# of States"), xaxis = list(title = "% Difference"),
         title = "Voting Percentage Difference Between 2016 and 2020")

We can see that overall, there were only positive percentage difference, which means that in all 51 states more people voted in the 2020 election than the 2016 election. Although the increase seems minimal at first glance, we also see that the scale of this graph is in millions of people. Therefore, visually small increases could have upwards of hundreds of thousands of people that voted depending on the population of the state.

Distributions of Felons in 2016

The right to vote in any election, specifically the presidential election in this case, differs greatly by state. Although the specifics of how long a felon has to wait before their voting rights are restored (if at all), that it outside the scope of this analysis. Instead, we will look at those who are immediately disqualified from voting in the plot below. The histogram shows the distribution of felons from the column total ineligible felons that are unable to vote for all states, grouped by bins with size 50k people each.

inelg = as.data.frame(as.integer(gsub(",", "", file$Total.Ineligible.Felon[2:52])))
colnames(inelg) = "Ineligible_Felons"
rownames(inelg) = c(1:51)

inelg_plot = plot_ly(inelg, x = ~Ineligible_Felons, name = 'Quizzes', 
                     type = 'histogram') %>% 
              layout(title = "Ineligible Felons")
inelg_plot

We can see from the histogram that a little over half of the states have less than 50k felons that are unable to vote, with a quarter of the states having between 50k and 100k felons that are unable to vote, this giving the histogram a right skew. However, there are 4 states that seem to have a very high felon count (200k and up), which are displayed below:

inelg["State"] = file$State[2:52]
inelg_tib = as.tibble(inelg)

filter(inelg_tib, Ineligible_Felons > 200000)
## # A tibble: 4 × 2
##   Ineligible_Felons State     
##               <int> <chr>     
## 1            207316 California
## 2            223139 Florida   
## 3            329754 Georgia   
## 4            492390 Texas

Depending on the population of the state, there is always a potential if these felons were able to vote that the outcome of an election would be different, but unfortunately that is outside the scope of this analysis as context data wasn’t present in the original data set.

Central Limit Theorem

In simple terms, Central Limit Theorem (CLT) states that a distribution of sample means will lean towards a normal distribution as sample size increases. We can see this theory in practice with the data. Shown below are histograms with 5000 random samples of sample sizes 5, 10, 15, and 20 using the total ballots counted estimate column in the data set. We also first display the population means, in other words the mean of the original data, to compare the means of the samples.

cat("Population Mean = ", mean(voted2020), "\n")
## Population Mean =  3114412
a = 0
b = 0
c = 0
d = 0
set.seed(6976)
for (size in c(5, 10, 15, 20)) {
  samples = numeric(5000)
  for (i in 1:5000) {
    samples[i] = mean(sample(voted2020, size, replace = FALSE))
  }
  
  if (size == 5) {
    a = plot_ly(as.data.frame(samples), x = ~samples, 
                type = "histogram", name = "Sample Size 5") %>%
      layout(xaxis = list(range = list(0, 10000000)))
  } else if (size == 10) {
    b = plot_ly(as.data.frame(samples), x = ~samples, 
                type = "histogram", name = "Sample Size 10") %>%
      layout(xaxis = list(range = list(0, 10000000)))
  } else if (size == 15) {
    c = plot_ly(as.data.frame(samples), x = ~samples, 
                type = "histogram", name = "Sample Size 15") %>%
      layout(xaxis = list(range = list(0, 10000000)))
  } else if (size == 20) {
    d = plot_ly(as.data.frame(samples), x = ~samples, 
                type = "histogram", name = "Sample Size 20") %>%
      layout(xaxis = list(range = list(0, 10000000)))
  }
  
  if (size == 5) {
    cat("Sample Size = ", size,
        " Mean = ", mean(samples),
        " SD = ", sd(samples), "\n")
  }
  else {
    cat("Sample Size = ", size,
        "Mean = ", mean(samples),
        " SD = ", sd(samples), "\n")
  }
}
## Sample Size =  5  Mean =  3075348  SD =  1367467 
## Sample Size =  10 Mean =  3109259  SD =  937794 
## Sample Size =  15 Mean =  3109983  SD =  693723.9 
## Sample Size =  20 Mean =  3118602  SD =  566620.5
subplot(a, b, c, d, nrows = 4) %>% layout(title = "Central Limit Theorem")

As we can see, when sample size is small (5 or 10), the distribution showed signs of skewing. As the sample size increased (15 or 20), the distribution showed near no skewing and instead takes the shape of a normal distribution. Regardless of the sample size, the mean of all the samples have a similar mean to the original data. Because of this, we have the option to use statistical techniques on this data that require a normal distribution as a prerequisite, which we did not have originally as shown in previous sections.

Sampling

There are many types of sampling methods that exist, all having the same purpose of identifying trends or patterns within subsets of the original data set. In this section, we examine the total ballots cast estimate column to see if we can identify any patterns using pre-existing sampling methods.

The three sampling methods shown below are simple random sampling without replacement, systematic sampling, and stratified sampling.

Simple random sampling without replacement (SRSWOR) randomly selects ‘x’ amount of values from the population, with each value having equal probability of being chosen. For this analysis, we randomly choose [x = 20] values and show the histogram of the distribution.

Systematic sampling first partitions ‘N’ amount of data into ‘n’ groups, where each group contains ‘k’ items, where ‘k’ is found by dividing ‘N’ by ‘n’, rounded up to the nearest integer. We then selected an item from the first group of ‘k’ items, and then select every kth item afterwards from the item selected from the first group. For this analysis, we have [N = 51] amount of data and choose [n = 20] groups to use. ‘k’ is then found by the calculation ‘N’/‘n’ [N = 51]/[n = 20] rounded up, which gives us 3. We now choose an item randomly from the first group, and every [k = 3]rd element after that, which the histogram of the distribution is shown. As a note, this type of sampling can introduce unintentional bias due to the way the elements are chosen.

Stratified sampling is where the data is divided into ‘N’ groups based on a certain characteristic. Then we perform simple random sampling without replacement on these subgroups (or stratums) until we have size ‘n’ samples, with the numbers of samples selected from each stratum proportional to the size of the stratum to the original data set size. For this analysis, we split the data into [N = 5] groups by counting the votes in groups of 2 million, and we want a total of [n = 20] samples.

The original data’s distribution is shown in the top left corner, simple random sampling without replacement is shown in the top right corner, systematic sampling is shown in the bottom left corner, and stratified sampling is shown in the bottom right corner. All plots shown below take their data from the total ballots counted estimate column in the dataset.

# Population

a = plot_ly(as.data.frame(voted2020), x = ~voted2020, 
                type = "histogram", name = "Population") %>%
      layout(xaxis = list(range = list(0, 18000000)))

# Simple Sampling Without Replacement

set.seed(6976)
ss_wor = srswor(20, length(file$State[2:52]))

ss_wor_vals = voted2020[ss_wor != 0]

b = plot_ly(as.data.frame(ss_wor_vals), x = ~ss_wor_vals, 
                type = "histogram", name = "SRSWOR") %>%
      layout(xaxis = list(range = list(0, 18000000)))

# Systematic Sampling

N = 51
n = 20
k = ceiling(N / n)

set.seed(6976)
r = sample(k, 1)

t = seq(r, by = k, length = n)

systematic = voted2020[t][!is.na(voted2016[t])]

c = plot_ly(as.data.frame(systematic), x = ~systematic, 
                type = "histogram", name = "Systematic") %>%
      layout(xaxis = list(range = list(0, 18000000)))

# Stratified Sampling

votes = as.data.frame(voted2020)
colnames(votes) = "Votes"
rownames(votes) = c(1:51)

votes_tib = as.tibble(votes)

votes_data = data.frame(Section = 1, filter(votes_tib, Votes > 0 & Votes < 2000000))
temp = data.frame(Section = 2, filter(votes_tib, Votes >= 2000000 & Votes < 4000000))
temp2 = data.frame(Section = 3, filter(votes_tib, Votes >= 4000000 & Votes < 6000000))
temp3 = data.frame(Section = 4, filter(votes_tib, Votes >= 6000000 & Votes < 8000000))
temp4 = data.frame(Section = 5, filter(votes_tib, Votes >= 8000000))

votes_data = rbind(votes_data, temp, temp2, temp3, temp4)

freq = table(votes_data$Section)

sizes = round(20 * freq / sum(freq), digits = 0)

set.seed(6976)
strat = strata(votes_data, stratanames = c("Section"), size = sizes, 
               method = "srswor")

strat = getdata(votes_data, strat)

d = plot_ly(as.data.frame(strat$Votes), x = ~strat$Votes, 
                type = "histogram", name = "Stratified") %>%
      layout(xaxis = list(range = list(0, 18000000)))

# Subplots

subplot(a, b, c, d, nrows = 2) %>% layout(title = "Sampling")
cat(" Population: Samples =", length(voted2016), "| Mean =", mean(voted2016), 
    "| SD =", sd(voted2016), "\n",
    "SRSWOR:     Samples =", length(ss_wor_vals), "| Mean =", mean(ss_wor_vals), 
    "| SD =", sd(ss_wor_vals), "\n",
    "Systematic: Samples =", length(systematic), "| Mean =", mean(systematic), 
    "| SD =", sd(systematic), "\n",
    "Stratified: Samples =", length(strat$Votes), "| Mean =", mean(strat$Votes), 
    "| SD =", sd(strat$Votes), "\n")
##  Population: Samples = 51 | Mean = 2808927 | SD = 2910807 
##  SRSWOR:     Samples = 20 | Mean = 3275876 | SD = 3317371 
##  Systematic: Samples = 17 | Mean = 2870429 | SD = 2521891 
##  Stratified: Samples = 19 | Mean = 3094616 | SD = 2973376

Regarding sample size, we can also see that systematic only gave us 17 samples, while stratified gave us 19 despite us wanting 20 samples. Due to the nature of how samples are chosen, there might be some unselected samples due to rounding errors.

We can see that simple random sampling without replacement, despite giving us the same number of values as the amount of sample we chose, has the highest variance away from the original data in both the mean and the standard deviation. Systematic sampling has the closest mean to the original data, with its standard deviation being a little far away. Stratified sampling has the closest standard deviation to the original, with its mean being relatively close to the original data.

If we wanted to do any of the previous analysis will the subsets produced by these sampling methods, a SRSWOR subset for analysis will most likely give us an incorrect representation of the original data due to its vastly different mean and standard deviation. Using a systematic subset will give us a good representation of the original data, but the ranges might be smaller due to the smaller standard deviations. Using a stratified subset will most likely be the best option out of the 3 sampling methods due to its similar mean and very close standard deviation to the original data.

Conclusions

The main conclusion that can be made is that more people voted in the 2020 election when compared to the 2016 one, as much as ranging from ~2% to ~14% depending on the state. Due to each state’s population being vastly different, this could be a difference of a few hundred to a few hundred thousand votes being cast.

One take away from this analysis is that despite being able to crunch numbers and plot graphs regarding the voting data from 2016 and 2020 elections, some of the analysis performed only has meaning if context clues from these elections were also included. For example, why did every state have more people voting in the 2020 election as opposed to the 2016 election? That’s not something that could be found from this dataset, but would require additional research into the situations of the times.

Historically, 2016 and 2020 were elections that generated a lot of press due to the unique circumstances of the political landscape at that time. Despite having the numbers to see the outcome of the election, one can argue that a full analysis will require context clues associated with the numbers in order to give an accurate analysis of these elections.